Poster: brewster Date: May 11, 2004 11:12pm
Forum: petabox Subject: Re: Selective powering of large petasites

Richard and "illtud"--

Thank you for the notes. We have had some experience with both tapes and hard drives at the Internet Archive and Television Archive, all of which points to the solution of keeping multiple copies and as active as possible.

At the Television Archive, which has holdings closing in on a petabyte, it started on tape and is now recording on hard drives that are kept offline. We dont have much experience on this collection on reading it back except reading back Sept-11-2001 to Sept 18, and it all worked fine.

A bigger tape experiment was trying to read 1000 DLT tapes recorded by the Internet Archive from 1996-1999 and had faults that made some tapes difficult to read and some limited data was lost. It was also very slow to read (took months of an administrators time). Since then, all data is recorded onto hard drives that are kept online.

Disks spinning seem to have a failure rate of 6% per year, but we are working on better measurements. When a disk "fails" it does not always lose data, or sometimes only one block, so recovery can be effective. But this means we should not keep one copy.

Our data protection system is to have at least 2 copies and preferably in distant locations (we have found that human error accounts for real loss as well, so having different administrative bodies helps). We keep copies in San Francisco and at the Library of Alexandria in Egypt.

We are developing the petabox for exactly this reason. It is bottom up designed for reliability, low power, and low cost. The low cost means that we can have 2 or more copies of even large datasets.

I am in Europe for the next 2 months setting up a European Internet Archive that will host those machines in Amsterdam. I would be very interested in talking with anyone about what we are doing if this is of interest.

I can be reached directly at brewster (at)

Digital Librarian

Poster: JTW Date: May 12, 2004 4:42pm
Forum: petabox Subject: Re: Selective powering of large petasites

Like you’ve been saying once you going beyond a couple Terabytes of data most is rarely accessed again or in some cases never again. This is one of the problems I see with very large databases. We have massive servers run 24/7 that only has about 1% of the data stored in it used and 97% of that data was added in the last 3-6 months. But because of the database software we’re running (and a management decision) all the data is stored and always powered on in one large database. But to the part that might be of interest to you, we also store reports long-term.

I’ve been looking into is having a system which has “on demand wake up” functionality for these reports. The “computers” and more importantly Hard Drives spend most of their time turned off in a sleep mode i.e. actually off and using a Network signal to the BIOS to bring them back to life when required. For large archives this could save thousands in power consumption, heat problems and should cut down hard drive failure rates. From a management side of things, if it’s possible to figure out in advance what is going to be accesses least, place them in the this long-term storage computer system, while keeping the more highly demanded data in always on subsystem.

From a topology point of view everything seems to be online 24/7 but in actuality it’s the requests for data that drive what systems are currently powered up. I’m on the prowl to see if anybody else is doing this before I invest time into creating our own solution for feasibility testing with off the shelf components and Linux. Initial with 4 boxes single 100GB drives (400GB total of data) and a master control. This controller will mount all the subsystem with NFS or SAMBA depend on the OS that worked best for hibernation / suspend modes. The idea being it’s only when someone access data in those subdirectories on the master controller that the other computer will power-up. Of course there needs to be some controlling program that knows the location of all the data you have, meaning you can’t just let the user start browsing the network looking for files as the systems will end up starting and stop ever couple of minutes.

Poster: brewster Date: May 12, 2004 5:32pm
Forum: petabox Subject: Re: Selective powering of large petasites

The Library of Alexandria has a copy of much of the web collection of the Internet Archive. They run their systems with a sleep after 3 minutes of inactivity setting. They report it works fine. In a separate test by Bruce Baumgart, he found it takes 9-10 seconds to spin a disk back up.

We have not done a large-scale test of this approach, but it sounds promising for many applications.

The petabox with spun-down disks would save 1/2 the power.